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Where  Does  the  Data  Come  From: 
Managing  Data  Integration  with  Source  Tagging  Capabilities 

ABSTRACT  Many  important  Management  Support  Systems  require  access  to  and  seamless 
integration  of  multiple  heterogeneous  database  systems.  This  paper  studies  heterogeneous  database 
systems  from  the  source  persp>ective.  It  aims  at  addressing  issues  such  as  the  following:  (1)  Where  is 
the  data  from?  (2)  Which  intermediate  sources  were  used  to  arrive  at  that  data?  Sp)edfically,  it 
presents  a  polygen  model  for  resolving  these  data  source  and  intermediate  source  problems.  The  polygen 
model  provides  a  precise  characterization  of  the  source  tagging  problem  and  a  solution  including  a 
pHDlygen  algebra,  a  data-driven  query  translation  mechanism,  and  the  necessary  and  sufficient  condition 
for  source  tagging.  This  model  has  been  developed  as  a  direct  extension  of  the  relational  model  to  the 
multiple  database  setting  with  source  tagging  capabilities,  thus  it  enjoys  all  of  the  strengths  of  the 
traditional  relational  model.  Source  knowledge  is  imp>ortant  for  many  reasons.  It  enables  users  to 
apply  their  own  judgment  to  the  credibility  of  the  information.  It  enables  users  to  rationalize  and 
reconcile  data  inconsistencies.  It  enables  system  designers  to  develop  access  charge  systems.  It  enables 
an  application  user  to  adjust  data.  And  it  enables  a  system  to  interpret  data  semantics  more  accurately. 
In  sum,  source  tagging  capabilities  should  be  a  required  functior\ality  for  future  heterogeneous  database 
systems. 

1.     Introduction 

The  rapidly  increasing  complexity,  interdependence,  and  competition  in  the  global  market  has 
profoundly  changed  how  corporations  operate  and  how  they  align  their  information  technology  for 
competitive  advantage  in  the  marketplace.  It  has  been  argued  (Madnick,  1989)  that  improved 
communications  capability  and  data  accessibility  will  lead  to  integration  of  systems  both  within  and 
across  organizational  boundaries  in  the  1990s.  This  will  lead  to  vastly  improved  group  communications 
and,  more  importantly,  the  integration  of  business  processes  across  traditional  functional,  product,  and 
geographic  lines.  The  integration  of  business  processes,  in  turn,  will  accelerate  demands  for  more 
effective  Management  Support  Systems  for  product  development,  product  delivery,  and  customer  service 
and  management  (Rockart  &  Short,  1989).  Increasingly,  many  important  Management  Support  Systems 
require  access  to  and  seamless  integration  of  multiple  heterogeneous  database  systems.  These  types  of 
heterogeneous  database  systems  have  been  referred  to  as  Federated  Database  Systems  (Czejdo, 
Rusinkiewicz,  &  Embley,  1987;  Elmasri,  Larson,  &  Navathe,  1987;  Heimbigner  &  McLeod,  1985; 
Lyngbaek  &  McLeod,  1983),  Multidatabases  (Ferrier  &  Strangret,  1982;  Litwin  &  Abdellatif,  1986; 
Litwin,  et  al.,  1982),  or  Composite  Information  Systems  (Madnick,  Siegel,  &  Wang,  1990;  Wang  & 
Madnick,  1988). 

In  this  paper,  we  study  heterogeneous  database  systems  from  the  multiple  source  perspective. 
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In  particular,  we  address  the  following  two  issues:     (1)  Where  is  the  data  from?     (2)  Which 
intermediate  data  sources  were  used  to  arrive  at  that  data? 

It  is  interesting  to  note  that  these  issues  have  not  been  directly  addressed  to  date. 
Contemporary  heterogeneous  database  systems  strive  to  encapsulate  the  heterogeneity  of  the 
underlying  databases  in  order  to  produce  an  illusion  that  all  information  originates  from  a  single 
anonymous  source.  This  illusion  has  been  referred  to  as  location  transparency  or  location  independence 
(Date,  1990).  In  our  field  studies  of  actual  needs,  we  have  found  that  although  users  want  the 
simplicity  of  location  transparency  for  query  formulation,  they  also  want  to  know  the  source  of  each 
piece  of  data  retrieved  (e.g..  Source:  Corporate  Customer  Database).  Most  managers,  be  they 
responsible  for  marketing,  production,  or  finance,  would  not  be  much  concerned  about  how  independent 
the  data  is  from  physical  storage  or  how  distributed  and  heterogeneous  the  physical  database  systems 
are.  Their  primary  concern  about  data  is  whether  it  could  facilitate  their  decision  making  processes. 
As  such,  knowing  the  source  of  each  piece  of  data  may  be  important  to  them  for  many  reasons,  for 
example: 

•  Source  knowledge  enables  managers  to  apply  their  own  judgment  to  the  credibility  of  the 
information.  In  our  discussions  with  managers,  several  exclaimed  that  data  would  be  totally 
useless  to  them  unless  they  know  its  source. 

•  Source  knowledge  enables  managers  to  rationalize  and  reconcile  data  inconsistencies.  For 
example,  the  attribute  "Return  on  Equity"  for  Reuters  Holdings  PLC  has  different  values  when 
retrieved  from  LP.  Sharp's  Disclosure  database  (based  in  Toronto)  compared  with  Finsbury's 
Dataline  database  (based  in  London).  It  is  likely  that  different  accounting  practices  in  Canada 
and  the  United  Kingdom  would  explain  the  difference  in  values,  thus  knowing  the  source 
countries  from  which  the  financial  data  was  compiled  would  help.  Furthermore,  since  Reuters 
is  a  UK -based  company,  the  Dataline  database  may  be  more  appropriate.  In  short,  knowing 
the  data  sources  helps  managers  to  rationalize  and  reconcile  the  data  inconsistencies  as  well  as 
make  their  own  judgment. 

•  Source  knowledge  enables  a  production  irunager  to  adjust  data.  In  a  manufacturing  firm  that  we 


interviewed,  production  data  was  extracted  from  plants  across  the  country  in  order  to  produce 
production  reports.  On  the  days  when  standard  time  is  switched  to  daylight  saving  time  and 
vice  versa,  the  production  volume  could  be  less  than  the  regular  volume  because  the  aggregates 
were  based  on  23  hours  instead  of  the  regular  24  hours.  Note  that  this  does  not  necessarily 
apply  to  all  plants;  Arizona,  for  instance,  does  not  participate  in  the  daylight  saving  time 
program.  With  source  knowledge,  the  production  data  due  to  time  zone  differences  could  be 
adjusted  appropriately. 

•  Source  knowledge  enables  system  designers  to  develop  access  charge  systems.  In  a  major 
financial  institution,  analysts  have  access  to  multiple  external  commercial  databases.  With 
data  source  knowledge,  system  designers  could  develop  systems  needed  for  internal  charge  back 
schemes.  For  example,  different  charges  could  be  associated  with  data  actually  returned  to  the 
user  versus  intermediate  data  used  in  the  query  process. 

•  Source  knowledge  enables  a  system  to  interpret  data  semantics  more  accurately  and  completely. 
For  example,  it  is  typical  for  country-specific  databases  to  omit  explicit  indication  of  currency 
used.  Thus,  a  product  may  have  a  price  of  2416.95.  Is  it  in  U.S.  dollars?  Japanese  yens?  or  U.K. 
pounds?  Knowing  the  data  source  can  often  provide  the  necessary  clarification.  In  this  case,  if 
the  data  source  is  Japan,  then  the  system  can  assume  that  the  currency  is  in  yen  unless  it  is  an 
exception.  This  information  can  be  used  in  conjunction  with  other  rules  to  determine  the  data 
semantics  correctly. 

Indeed,  it  has  been  suggested^  that  knowing  the  data  source  is  so  important  that  it  should  be  a 
required  feature  for  heterogeneous  database  systems.  Providing  source  tagging  capabilities  for 
heterogeneous  database  systems  requires  an  understanding  of  the  constraints  involved  both 
organizationally  and  technically. 

Most  organizations  must  deal  with  pre-existing  information  systems  which  have  been 
developed  and  administered  independently,  and  are  likely  to  remain  so.    Many  of  these  systems  are 


One  of  our  technical  colleagues  has  suggested  that  data  source  tagging  capabilities  be  added  as  Date's 
thirteenth  rule  for  distributed  database  systems  (Date,  1990). 


controlled  by  autonomous  subsidiaries  or  even  separate  corporations  (e.g.,  Dow  Jones  financial  services) 
that  are  reluctant  or  unwilling  to  change  their  systems.  This  implies  that  one  should  not  require  data 
to  be  augmented  in  a  pre-existing  information  system  in  order  to  allow  for  data  integration.  We  resolve 
this  problem  by  tagging  data  source  after  the  data  has  been  retrieved  from  a  local  database. 

Technically,  it  is  important  to  develop  source  tagging  capabilities  based  on  previous  database 
research  results.  It  would  enable  us  to  enjoy  all  of  the  strengths  of  the  conventional  database  systems, 
such  as  the  capabilities  to  allow  for  data  sharing  by  multiple  users  concurrently  and  to  remove  many  of 
the  file  handling  details  from  the  concern  of  application  programmers.  In  order  to  accomplish  this,  we 
have  to  understand  the  trade-offs  of  different  data  models,  the  mechanisms  used  in  these  models  to 
perform  data  definition  and  data  manipulation,  and  develop  a  new  algebra  and  a  query  processing 
mechanism  for  facilitating  source  tagging  capabilities.  Our  research  contributions  can  be  summarized  as 
follows: 

(1)  We  have  developed  a  polygen  modefi  to  study  heterogeneous  database  systems  from  the  multiple 
(poly)  source  (gen)  perspective.  The  polygen  model  provides  a  precise  characterization  of  the 
source  tagging  problem  and  a  solution  including  a  polygen  algebra,  a  data-driven  query  translation 
mechanism,  and  the  necessary  and  sufficient  condition  for  source  tagging.  A  concrete  example  is  also 
provided  to  illustrate  the  basic  mechanism. 

(2)  We  have  developjed  the  polygen  model  as  a  direct  extension  of  the  relational  model  to  the  multiple 
database  setting  with  source  tagging  capabilities,  thus  the  polygen  model  enjoys  all  of  the 
strengths  of  the  traditional  relational  model. 

(3)  We  have  established  a  theoretical  foundation  for  resolving  many  other  critical  research  issues. 
For  example,  the  polygen  algebra  can  be  extended  to  address  other  basic  attributes  associated  with 
data,  such  as  the  temporal  aspect  of  data.  Users  normally  want  to  know  not  only  where  the  data  is 


To  highlight  the  source  tagging  problems,  the  phrase  "polygen  model"  will  be  used  in  the  paper  instead 
of  the  conventional  "global  model."  By  the  same  token,  "polygen  query"  will  be  used  instead  of  "global 
query,"  and  so  on,  and  so  forth. 


from  but  also  when  the  data  was  collected  and  how  it  was  collected.  Furthermore,  as  we  motivated 

earlier,  knowing  the  data  source  will  enable  a  user  or  a  query  processor  to  interpret  the  data 

semantics  more  accurately;  knowing  data  source  credibility  will  enable  the  user  or  the  query 

processor  to  hjrther  resolve  potential  conflicts  amongst  the  data  retrieved  from  different  sources; 

and  knowing  data  access  cost  will  enable  system  designers  to  develop  access  charge  systems. 

n  SOURCE  TAGGING  EXAMPLE 

In  preparing  a  special  rejxjrt^  on  the  top  ten  graduate  programs  in  Information  Systems,  a 

member  of  the  ComputerWorld  staff  called  one  of  the  schools  to  get  the  names  of  CEO's  who  graduated 

from  the  school  wUh  an  MBA  degree.  Suppose  that  the  following  SQL  polygen  query 

SELECT  ONAME,  CEO 

FROM  PORGANIZATION,  PALUMNUS 

WHERE  CEO  =  ANAME  AND  DECREE  =  "MBA" 

was  created  given  a  polygen  schema  derived  from  the  Alumni  Database  and  Company  Database  below. 

For  expository  purposes,  the  prefix  "P"  is  used  to  denote  a  polygen  scheme  in  the  p>olygen  schema. 


Polygen  Schema 

Alumni  Database  (AD): 
Alumni  Schema 

Company  Database  (CD); 
Company  Schema 

PORGANIZATION(ONAME.  INDUSTRY, 
CEO,  HEADQUARTERS) 

BUSINESStBNAME.  IND) 

FIRM(FNAME,  CEO,  HO) 

PHNANCE  (ONAME,  YEAR,  PROFIT) 

HNANCECFNAME.  YR.  PRORT) 

PALUMNUS(AID#,  ANAME, 
DEGREE,  MAJOR) 

ALUMNUS(AID#,  ANAME, 
DEG,  MAJ) 

PCAREER  (A1D#,  ONAME^ 

CAREER(AID«,  BNAME) 

In  the  table  above,  a  firm  in  the  Company  Database  has  a  name,  a  CEO,  and  is  headquartered 
in  a  city.  It  discloses  yearly  financial  information  on  profit.  Each  alumnus  in  the  Alumni  Database  is 
uniquely  identified  through  an  alumnus  identification  number  (AIDtt).  Associated  with  each  alumnus 
is  a  name,  a  degree,  and  a  major.  An  alumnus  may  have  careers  in  many  businesses,  and  each  business  is 
associated  with  an  industry.  Attribute  mapping  relationships  between  the  polygen  schema  and  the 
Alumni  Database  and  the  Company  Database  are  shown  in  Section  2. 

The  query  result  contains  only  the  names  of  the  CEOs  which  originated  from  the  Company 
Database,  but  the  query  processor  also  needs  to  access  the  Alumni  Database  (an  intermediate  source)  in 
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order  to  select  those  CEOs  who  received  an  MBA  degree.  Moreover,  the  query  processor  needs  to  'Tcnow" 
that  it  has  to  merge  the  BUSINESS  and  the  FIRM  relations  first  before  joining  the  CEO  attribute  with 
the  ANAME  attribute.  As  such,  the  challenge  is  to  develop  not  only  a  polygen  model  but  also  a  pwlygen 
algebra  and  the  algorithms  for  a  polygen  query  processor  capable  of  resolving  the  data  and 
intermediate  source  tagging  problems  for  any  arbitrary  polygen  query.  Tagging  the  Company  Database 
name  accurately  to  the  result  is  referred  to  as  the  Data  Source  Tagging  problem.  Tagging  the 
intermediate  use  of  the  Alumni  Database  accurately  is  referred  to  as  the  Intermediate  Source  Tagging 
problem. 

1^  RESEARCH  ISSUES  AND  GOALS 

We  have  reviewed  a  broad  range  of  literature  and  examined  various  research  prototypes  of 
heterogeneous  distributed  database  systems.  The  systems  that  we  studied  (Gupta,  1989;  Wang  & 
Madnick,  1988)  included  MULTIBASE  in  the  United  States  (Smith,  et  al.,  1981),  PRECI*  in  England 
(Deen,  Amin,  &  C,  1987;  Deen,  Amin,  &  Taylor,  1987),  and  MRDSM  in  France'^  (Litwin  &  Abdellatif, 
1986;  Litwin,  et  al.,  1982).  In  addition,  we  have  surveyed  more  than  forty  U.S.  commercial  systems 
offering  partial  solutions  to  the  heterogeneous  distributed  database  problem,  including  Data 
Integration's  MERMAID,  Cincom's  SUPRA,  Metaphor's  DIS,  and  TRW's  Data  Integration  Engine 
(Gupta,  et  al.,  1989).  To  the  best  of  our  knowledge,  none  of  these  systems  have  dealt  with  these  source 
tagging  problems. 

Two  related  issues,  among  others,  need  to  be  addressed  in  source  tagging:  (1)  What  kind  of 
polygen  model  should  be  created  in  order  to  tag  multiple  sources  explicitly?  (2)  What  is  the 
relationship  between  the  polygen  model  and  the  polygen  query  processing  facility? 

Most  heterogeneous  distributed  database  systems  adopt  one  of  the  following  four  data  models 
(Hull  &  King,  1987;  Peckham  &  Maryanski,  1988):  the  Relational  Model,  the  Functional  Data  Model, 
the  Semantic  Database  Model,  or  the  Entity  Relationship  Model.    Each  data  model  has  merits  for  its 


In  MRDSM,  an  administrator  may  define  for  any  collection  of  databases  a  collective  name  called  a 
multidatabase  name.  For  instance,  the  databases  Michelin,  Kleber,  and  Gault_M  may  collectively  get 
the  name  Rest_guides.  However,  the  focus  of  such  names  is  to  simplify  the  expression  of  some 
commands;   otherwise,  these  commands  may  require  an  enumeration  of  the  corresponding  databases. 


intended  purposes.  For  example,  both  the  Functional  Data  Model  and  the  Sennantic  Database  Model 
are  rich  in  semantics  and  implemented  in  operational  systems.  The  Entity  Relationship  Model  is  also 
rich  in  semantics  and  is  widely  accepted  as  the  leading  database  design  tool.  The  relational  model 
lends  itself  to  a  simple  structure  and  an  elegant  theoretical  foundation.  In  addition,  Relational  Data 
Base  Management  Systems  dominate  the  database  market  today.  Moreover,  the  relational  model  has 
been  extended  (Codd,  1979)  to  capture  semantics  such  as  generalization  and  aggregation.  In  order  to 
consider  both  of  the  rigorous  and  pragmatic  asp)ects,  we  selected  the  relational  model.  Based  on  the 
relational  model,  we  define,  in  this  paper,  a  polygen  model  for  resolving  the  data  and  intermediate 
source  tagging  problems. 

One  of  the  key  activities  in  formulating  composite  information  is  to  translate  a  p>olygen  query 
into  a  set  of  local  queries,  which  in  turn  are  routed  to  the  corresponding  local  databases.  Query 
translation  has  been  approached  through  view  definition  in  most  heterogeneous  distributed  database 
systems  (Breitbart,  Olson,  &  Thompson,  1986;  Brill,  Templeton,  &  Yu,  1984;  Dayal  &  Hwang,  1984; 
Deen,  Amin,  &  C,  1987;  Deen,  Amin,  &  Taylor,  1987;  Ferrier  &  Strangret,  1982;  Katz  &  Goodman,  1981; 
Litwin  &  Abdellatif,  1986;  Litwin,  et  a!.,  1982;  Templeton,  et  al.,  1983).  A  symbolic  query 
transformation  technique  has  also  been  proposed  (Rusinkiewicz  &  Czejdo,  1985;  Rusinkiewicz,  et  al., 
1988)  in  which  a  syntax-directed  parser  converts  a  polygen  query  and  transformation  rules^  into 
multiway  trees.  Through  subtree  matching,  these  multiway  trees  are  further  translated  into  local 
queries,  given  the  sp>ecific  source  and  target  language  syntax  descriptions. 

As  we  will  discuss  later,  our  query  translation  mechanism  differs  from  the  above  mentioned 
techniques  in  two  important  aspects:  (1)  Instead  of  the  view  definition  approach  which  encodes  the 
procedure  for  translating  a  polygen  query  into  the  corresponding  local  queries,  our  mechanism  separates 
the  mapping  algorithm  from  the  mapping  data.  As  a  result,  adding  a  new  database  to  the  existing 
system  does  not  require  modifying  the  existing  procedural  view  definitions.  (2)  Instead  of  the  symbolic 
query  transformation  technique  which  tackles  a  broad  range  of  nodal  query  languages  at  a  higher  level. 

Each  transformation  rule  contains  a  source  part  and  a  target  part.   For  example. 

Source:  SELECT  attnbute-l     FROM  relation-1     WHERE  condirion; 

Target:-  Projection     ((attribute-!).  Selection  (condition,  (relation-1)); 


our  mechanism  focuses  on  the  mapping  between  a  polygen  algebraic  expression  and  the  corresponding 
local  operations,  pjermitting  entities  (and  attributes)  in  local  databases  to  overlap  one  another. 

1.3  RESEARCH  BACKGROUND  AND  ASSUMPTIONS 

We  have  developed  a  heterogeneous  database  system  which  currently  has  access  to  three 
internal  databases  (the  Alumni  Database,  the  Placement  Database,  and  the  Student  Database)  and 
three  external  commercial  databases  (Finsbury's  Dataline  and  LP.  Sharp's  Disclosure  and  Currency). 
The  query  processor  architecture  is  depicted  in  Figure  1.  Briefly,  the  Application  Query  Processor 
translates  an  end-user  query  into  a  polygen  query  for  the  Polygen  Query  Processor  (PQP)  based  on  the 
user's  application  schema.  The  word  "polygen"  is  used  here  to  signify  that  the  query  processor  is 
equipped  with  source  tagging  capabilities.  The  PQP  in  turn  translates  the  polygen  query  into  a  set  of 
local  queries  based  on  the  corresponding  polygen  schema,  and  routes  them  to  the  Local  Query  Processors 
(LQP).  The  details  of  the  mapping  and  communication  mechanisms  between  an  LQP  and  its  local  data 
bases  is  encapsulated  in  the  LQP.  To  the  PQP,  each  LQP  behaves  as  a  local  relational  system.  Upon 
return  from  the  LQPs,  the  retrieved  data  are  further  processed  by  the  PQP  in  order  to  produce  the 
desired  composite  information. 

Many  critical  problems  need  to  be  resolved  in  order  to  provide  a  seamless  solution  to  the  end- 
user.  These  problems  include  source  tagging,  query  translation,  schema  integration  (Batini,  Lenzirini,  & 
Navathe,  1986;  Elmasri,  Larson,  &  Navathe,  1987),  inter-database  instance  matching  (Wang  & 
Madnick,  1989b),  domain  mapping  (DeMichiel,  1989;  Shin,  1988),  and  semantic  reconciliation  (Wang  & 
Madnick,  1989a).  We  focus  on  the  first  two  problems  and  make  the  following  assumptions  in  this  paper: 

•  The  data  source  is  tagged  after  the  data  has  been  retrieved  from  a  local  database. 

•  The  local  schemata  and  the  polygen  schema  are  all  based  on  the  relational  model. 

•  Schema  integration  has  been  performed,  and  the  attribute  mapping  information  is  stored  in  the 
polygen  schema. 

•  The  inter-database  instance  identifier  mismatching  problem  (e.g.,  IBM  vs.  I.B.M  or  social 
security  identification  number  vs.  employee  identification  number)  has  been  resolved  and  the 
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information  is  available  for  the  PQP  to  use.  A  discussion  of  this  issue  has  been  presented 
elsewhere  (Wang  &  Madnick,  1989b). 

The  domain  mismatch  problem  such  as  unit  ($  vs.  V),  scale  (in  billions  vs.  in  millions),  and 
description  interpretation  ("expensive"  vs.  "$$$",  "Chinese  Cuisine"  vs.  "Hunan  or  Cantonese") 
has  been  resolved  during  schema  integration  and  the  information  is  also  available  to  the  PQP. 
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Query 


DBMS 
Result 


Figure  1 :  The  Query  Processor  Architecture 


Section  2  defines  the  polygen  model.  Polygen  query  translation  is  presented  in  Section  3. 
Section  4  provides  a  detailed  example  of  the  basic  polygen  query  processing  mechanism.  The  necessary 
and  sufficient  condition  of  source  tagging  is  presented  in  Section  5.  Finally,  concluding  remarks  are  made 
in  section  6. 


2.     The  Polygen  Model 

To  pre^nt  the  polygen  model  more  concretely,  we  first  exemplify  the  attribute  mapping 
relationships  between  the  polygen  schema  and  their  corresponding  local  schemata  in  the  form 
{(database,  relation,  attribute),...)  for  the  source  tagging  example  described  in  Section  1. 

The  PORGANIZATION  Polygen  Scheme 


ONAME 


INDUSTRY 


CEO 


HEADQUARTERS 


((AD,  BUSINESS,  BNAME), 
(CD,  RRM,  FNAME)) 


1(AD,  BUSINESS,  IND)) 


KCD,  RRM,  CEO)) 


((CD,  RRM,  HQ)) 


The  PRNANCE  Polygen  Scheme 

ONAME 

1                        YEAR                       1                       PROFIT 

((CD,  FINANCE,  FKAME)) 

1         ((CD,  FINANCE,  YR))          |       ((CD,  FINANCE,  PROFIT)) 

The  PALUMNUS  Polygen  Scheme 


AID#  ANAME  DEGREE  MAIOR 


((AD,  ALUMNUS,  AID#»      |     ((AD.  ALUMNUS.  ANAME))     |    ((AD,  ALUMNUS,  DEC))    |    ((AD,  ALUMNUS,  MAJ)) 


The  PCAREER 

Polvgen  Scheme 

A1D# 

1 

ONAME 

((AD, 

CAREER, 

A1D#))     1 

((AD, 

CAREER,  BNAME)) 

We  now  define  the  polygen  model.  Let  PA  be  a  polygen  attribute  in  a  polygen  scheme  P,  LS  a 
local  scheme  in  a  local  database  LD,  and  LA  a  local  attribute  in  LS.  For  example,  ONAME  is  a  p>olygen 
attribute  in  the  polygen  scheme  PORGANIZATION,  BUSINESS  a  local  scheme  in  the  local  database 
AD,  and  BNAME  a  local  attribute  in  the  local  scheme  BUSINESS. 

Let  MA  be  the  set  of  local  attributes  correspKjnding  to  a  PA,  i.e., 

MA  =  {(LD,  LS,  LA)  I  (LD,  LS,  LA)  denotes  a  local  attribute  to  the  corresponding  PA). 

For  ONAME  in  the  PORGANIZATION  polygen  scheme,  MA  =  {(AD,  BUSINESS,  BNAME), 
(CD,  HRM,  FNAME)). 

A  polygen  scheme  P  is  defined  as 

P  =  <(PAi,  MAi), . . . ,  (PA„,  MAfl))  where  n  is  the  number  of  attributes  in  P. 

For  the  polygen  scheme  PORGANIZATION  in  the  above  scenario, 

PORGANIZATION  =  ((ONAME,  |(AD,  BUSINESS,  BNAME),  (CD,  HRM,  FNAME))),  (INDUSTRY,  ((AD, 
BUSINESS,  IND)]),  (CEO,   {(CD,  HRM,  CEO))),  (HEADQUARTERS,  {(CD,  HRM,  HQ)))) 

A  polygen  schema  is  defined  as  a  set  {Pi, . . . ,  P^;}  of  N  polygen  schemes.  In  the  above  scenario, 

the  polygen  schema  consists  of  the  following  schemes: 
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{PORGANIZATION,  PnNANCE,PALUMNnjS,  PCAREER} 
A  fX)lygen  domain  is  defined  as  a  set  of  ordered  triplets.  Each  triplet  consists  of  three  elements: 
the  first  is  a  datum  drawn  from  a  simple  domain  in  an  LQP.  The  second  is  a  set  of  LDs  denoting  the  local 
databases  from  which  the  datum  originates.   The  third  is  a  set  of  LDs  denoting  the  intermediate  local 
databases  whose  data  led  to  the  selection  of  the  datum. 

A  jX)lygen  relation  p  of  degree  n  is  a  finite  set  of  time-varying  n-tuples,  each  n-tuple  having  the 
same  set  of  attributes  drawing  values  from  the  corresponding  polygen  domains.  A  cell  in  a  polygen 
relation  is  an  ordered  triplet  c=(c(d>,  c(o),  c(i))  where  c(d>  denotes  the  datum  portion,  c(o)  the  originating 
portion,  and  c{\)  the  intermediate  source  portion.  Two  polygen  relations  are  union<ompatible  if  their 
corresponding  attributes  are  defined  on  the  same  polygen  domain. 

Note  that  P  contains  the  mapping  information  between  a  fx)lygen  scheme  and  the  corresponding 
local  relational  schemes.  In  contrast,  p  contains  the  actual  time-varying  data  and  their  originating 
sources.  A  polygen  scheme  P  and  a  polygen  relation  p  may  be  used  synonymously  without  corxfusion.  The 
data  and  intermediate  source  tags  for  p  are  updated  along  the  way  as  polygen  algebraic  operations  are 
f)er  formed. 

2.1  THE  POLYGEN  ALGEBRA 

Let  attrs(p)  denote  the  set  of  attributes  in  p.  For  each  tuple  t  in  a  polygen  relation  p,  let  t(d) 
denote  the  data  pwrtion,  t(o)  the  originating  source  portion,  and  t(i)  the  intermediate  source  portion.  If 
X  6  attrs(p),  X  =  {xi...,x,,...,xj)  is  a  sublist  of  attrs(p),  then  let  p(x)  be  the  column  in  p  corresf)onding  to 
attribute  x,  let  pKX)  be  the  columns  in  p  corresponding  the  sublist  of  attributes  X,  let  t(x)  be  the  cell  in  t 
corresponding  to  attribute  x,  and  let  t(X)  be  the  cells  in  t  corresponding  to  the  sublist  of  attributes  X.  As 
such,  p(x)(o>  denotes  the  originating  source  portion  of  the  column  corresponding  to  attribute  x  in  polygen 
relation  p  while  t(XXi)  denotes  the  intermediate  source  portion  of  the  cells  corresponding  to  the  sublist 
of  attributes  X  in  tuple  t.  On  the  other  hand,  p(x)  denotes  the  column  corresponding  to  attribute  x  in 
polygen  relation  p  inclusive  of  the  data,  originating  source,  and  intermediate  source  portions  while  t(X) 
denotes  the  cells  corresponding  to  the  sublist  of  attributes  X  in  tuple  t  inclusive  of  the  data,  originating 
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source,  and  intermediate  source  portions.    Note  that  the  "(  )"  notation  in  project  p(X)  should  not  be 
confused  with  the  operation  f>(x=y). 

It  has  been  shown  (Maier,  1983)  that  in  a  conventional  relational  system,  a  relational  algebra 
can  be  defined  through  five  orthogonal  algebraic  operators.    Here  we  define  the  five  orthogonal 
algebraic  operators  in  the  context  of  our  polygen  model: 
Project.  If  p  is  a  p>olygen  relation,  and  X  =  {xi...,x,,...,xj)  is  a  sublist  of  attrs(p),  then 

p{X)  =  {f  I  f  =  t(X)  if  t6  p  A  t(X)<d>  is  unique; 

t'<d)=ti(X){d),  t'(xj)<o>=  ti(xj)(o>  U...U  tk(xj)(o>  V  xj  e  X  t'(xj)<i>=  ti(xj){i>  u...u  tk(xj)<i>  V  xj  e  X 
if  ti , .. .  tk  ep  A  ti(X)<d>=...=  tk(XKd)). 

The  above  expression  specifies  that  if  the  data  portion  of  a  projected  tuple  is  unique,  then  the 
originating  source  portion  and  the  intermediate  source  portion  is  identical  to  the  those  of  before  the 
projection.  This  is  correct  because  in  this  case,  only  one  piece  of  data  is  involved.  On  the  other  hand,  if 
k  tuples  has  the  same  projected  data  portion,  then  the  expression  spjecifies  that  the  operator  will  take 
any  one  of  the  data  portion  of  these  tuples  to  be  the  projected  data  portion  (since  they  are  the  same), 
and  take  the  union  of  the  originating  source  portions  as  the  new  original  source  portion  for  each  of  the 
cells  in  the  projected  relation.  By  the  same  token,  the  project  operator  will  take  the  union  of  the 
intermediate  source  portions  as  the  new  intermediate  source  portion  for  each  of  the  cells  in  the  projected 
relation.  This  is  also  correct  because  the  projected  data  have  been  drawn  from  all  the  originating 
sources  and  have  been  derived  with  the  involvement  of  all  the  intermediate  sources. 
Cartesian  product.  If  p,  and  p)2  are  two  polygen  relations,  then 
(pi  xp2)  =  (tj  °  t2  1  ti   e  pv]    and  t2   €  p2  where  °  denotes  concatenation). 

The  above  expression  specifies  that  each  tuple  in  p,  is  concatenated  with  every  tuple  in  p2 

following  the  definition  of  the  Cartesian  product.    Since  no  data  items  are  merged  in  this  case,  the 

originating  source  and  intermediate  source  portions  remain  to  be  the  same. 

Restrict.  If  p  is  a  polygen  relation,  x  €  attrs(p),  y  e  attrs(p),  and  0  is  a  binary  relation,  then 

p(x  e  y)  =  (f  I  t'(d>  =  t<d),  t'(o)  =  t(o),  f(w)(i)  =  t(w)<i>  u  t(x)(o>  u  t(y)(o)  V  w  €  attrs(p), 
iftep  A  t(x)<d)  e  t(y)<d>}. 
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The  above  expression  specifies  that  for  each  of  the  tuple  in  p  that  satisfies  the  9  relation,  the 
restrict  operator  will  update  the  intermediate  source  portion  to  include  the  originating  sources  of  t(x) 
and  t(y)  because  they  are  used  to  produce  the  new  polygen  relation.  The  originating  source  portion  is  not 
effected  because  the  data  portion  is  unique  (we  assume  that  duplicate  tuples  are  treated  as  separate 
tuples).  Since  Select  and  Join  are  defined  through  Restrict,  they  also  upxiate  t(i>. 
Union.  If  pi  and  p2  are  two  polygen  relations  and  both  have  degree  n,  ti  e  pi,  t2  g  pj,  then 

(pi  u  P2 )  =  (f  I  t  -  ti  if  ti<d>e  pi  A  t,(d>«  P2; 
t'=t2  if  t2<d)  €  p,  A  t2<d>  6  P2; 
t'<d>=t,<d>,  t'(o>=t,(o>  u  t2<o>,  tXi>=t,<i>  u  t2<i>  if  t,(d)=t2<d» 

The  above  expression  specifies  that  for  the  tuples  that  exist  in  only  one  polygen  relation,  the 
union  operator  will  copy  it  over  to  be  the  new  tuple.  On  the  other  hand,  if  the  data  portion  of  two 
tuples,  ti(d>  and  t2{d),  are  identical,  then  the  operator  will  copy  the  data  portion  over  to  the  data 
portion  of  the  new  tuple,  and  take  the  union  of  the  originating  sources  to  be  the  originating  sources  for 
each  of  the  cells  in  the  new  tuple.  By  the  same  token,  the  operator  will  take  the  union  of  the 
intermediate  source  portions  as  the  new  intermediate  source  portion  for  each  of  the  cells  in  the  new 
tuple. 

Difference.  Let  p<o)  denote  the  union  of  all  the  t(o>  sets  in  p,  and  p<i)  denote  the  union  of  all  the  t<i)  sets 
in  p.  If  pi  and  p2  are  two  polygen  relations  and  both  have  degree  n,  then 

(Pi  -  P2)  =  (f  I  t'(d>=  t<d),  t'(o>  =   t(o>,  t'(w)<i)  =  t(w)<i>u  p2(o)  u  p2<i>  V  w  e  attrs(p),  if  t  e  pi  and 
t(d>ep2). 

Difference  selects  a  tuple  in  pi  to  be  a  tuple  in  (p,  -  P2)  if  the  data  portion  of  the  tuple  in  p^  is 
not  identical  to  those  of  the  tuples  in  p>2.  Since  each  tuple  in  pi  needs  to  be  compared  with  all  the  tuples 
in  P2,  it  follows  that  all  the  originating  sources  of  the  data  in  p2  should  be  included  in  the  intermediate 
source  set  of  (pi  -  P2),  as  t'(i)  =  t(i)  u  p2(o>  u  p)2(i>  denotes. 

Other  traditional  operators  can  be  defined  in  terms  of  the  above  five  operators.  The  most 
common  are  Join,  Select,  and  Intersection.  Join  and  Select  are  defined  as  the  restriction  of  a  Cartesian 
product.  Intersection  is  defined  as  the  project  of  a  join  over  all  the  attributes  in  each  of  the  relations 
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involved  in  the  Intersection. 

In  order  to  process  a  pwlygen  query,  we  also  need  to  introduce  the  following  new  op>erators  to  the 
polygen  model:  Retrieve,  Coalesce,  Outer  Natural  Primary  join,  Outer  Natural  Total  Join,  and  Merge. 

A  local  database  relation  needs  to  be  retrieved  from  a  local  database  to  the  PQP  first  before  it 
is  considered  as  a  PQP  base  relation.  This  is  required  in  the  polygen  model  because  a  pwlygen  operation 
may  require  data  from  multiple  local  databases.  Although  a  PQP  base  relation  can  be  materiaHzed 
dynamically  like  a  view  in  the  conventional  database  system,  for  conceptual  purposes,  we  define  it  to 
reside  physically  in  the  PQP.^  The  Retrieve  operation  is  defined  as  an  LQP  Restrict  of>eration  without 
any  restricting  condition. 

Coalesce  and  Outer  Natural  Join  have  been  informally  introduced  by  Date  to  handle  a 
surprising  number  of  practical  applications.  Coalesce  takes  two  columns  as  input,  and  coalesce  them  into 
one  column.  An  Outer  Natural  Join  is  an  outer  join  with  the  join  attributes  coalesced  (Date,  1983). 

We  define  an  Outer  Natural  Primary  Join  as  an  Outer  Natural  Join  on  the  primary  key  of  a 
polygen  relation.  For  example,  the  Outer  Natural  Primary  Join  for  PORGANIZATION  is  an  Outer 
Natural  Join  on  ONAME.  An  Outer  Natural  Total  Join  is  an  Outer  Natural  Primary  Join  with  all  the 
other  polygen  attributes  in  the  polygen  relation  coalesced  as  well.  In  the  PORGANIZATION  example, 
an  Outer  Natural  Total  Join  would  perform  an  Outer  Natural  Primary  Join  on  ONAME  followed  by  a 
number  of  Coalesce  operations  on  INDUSTRY,  CEO,  and  HEADQUARTERS.  Merge  extends  Outer 
Natural  Total  Join  to  include  more  than  two  px)lygen  relations.  It  can  be  shown  that  the  order  in  which 
Outer  Natural  Total  Join  are  performed  over  a  set  of  polygen  relations  in  a  Merge  is  imnrvaterial. 

Since  Coalesce  can  be  used  in  conjunction  with  the  other  polygen  algebraic  operators  to  define 
the  Outer  Natural  Primary  Join,  Outer  Natural  Total  Join,  and  Merge,  we  define  Coalesce  as  the  sixth 
orthogonal  primitive  of  the  piolygen  model. 

Coalesce.    Let  ©  denote  the  coalesce  operator.   If  p  is  a  polygen  relation,  x  e  attrs(p),  y  €  attrs(p),  z  = 
attrs(p)  -  (x,  y},  and  w  is  the  coalesced  attribute  of  x  and  y,  then 


This  approach  simplifies  the  Polygen  Operation  Interpreter,  to  b>e  presented  in  Section  III. 
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p(x  ®  y:w)  = 

(f  I  f(z)=t(z),  f(w)<d>=t(x)(d>,  t'(w)<o>  =t(xKo)  ut(y)(o> ,  f(w)<i)  =t{x)<i>  ut(y)<i> ,  if  t(x)<d>=t(y)(d); 

f(z)=t(z),  f(wKd>=t(x)<d>,  f(w)<o>  =t(x)<o) ,  f(w)<i)  =t(x)(i)  ,  if  t(yKd)=niI; 

f(z)=t(z),  t'(w)<d>=t(yKd>,  f(w)<o>  =t(y)(o> ,  t'(wKi>  =t(y)<i> ,  if  t(x)<d)=nil}. 

The  above  expression  specifies  that  attribute  x  and  attribute  y  will  be  coalesced  into  a  new 
attribute  called  w.  If  both  of  the  data  portion  of  attribute  x  and  attribute  y  exist  (and  by  the  definition 
of  coalesce,  they  must  have  the  same  value),  then  the  coalesce  operator  w^ill  copy  the  data  portion  of 
the  cell  over  to  attribute  w,  and  take  the  union  of  the  originating  sources  to  be  the  originating  sources  of 
the  cell  in  the  new  tuple.  By  the  same  token,  the  operator  will  take  the  union  of  the  intermediate 
source  portions  as  the  new  intermediate  source  portion  of  the  cell  in  the  new  tuple.  For  those  tuples  that 
either  the  data  portion  of  attribute  x  or  attribute  y  does  not  exist,  the  of)erator  will  copy  the  cell  with 
data  over  to  the  new  tuple.  Note  that  in  a  heterogeneous  distributed  environment,  the  data  values  to  be 
coalesced  may  be  inconsistent.  It  is  assumed  that  inter-database  instance  mismatching  problems  (Wang 
&  Madnick,  1989b)  will  be  resolved  before  the  coalesce  operation  is  performed. 

We  have  presented  the  polygen  model  and  the  polygen  algebra.  The  algebra  will  be  used  in 
Section  4  to  compose  information  with  data  source  tags  and  intermediate  source  tags.  In  order  to  do  that, 
it  is  necessary  to  know  the  process  of  translating  a  polygen  query  into  a  query  execution  plan.  This 
process  is  presented  below. 

3.     Polygen  Query  Translation 

For  the  SQL  polygen  query  presented  in  Section  1,  a  corresponding  polygen  algebraic  expression 

for  the  SQL  polygen  query  is  as  follows: 

PALUMNUS  (DEGREE  =  "MBA")  (ANAME  =CEO)  PORGANIZATION  (ONAME,  CEO) 
In  this  expression,  those  alumni  with  an  MBA  degree  are  selected  from  the  PALUMNUS 

relation.   The  result  is  joined  with  the  PORGANIZATION  relation  where  an  alumnus  is  also  a  CEO, 

followed  by  a  projection  on  ONAME  and  CEO. 
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In  general,  the  PQP  takes  a  polygen  algebraic  expression  as  an  input  and  produces  a  query 
execution  plan  for  retrieving  data  from  the  local  databases  and  formulating  composite  information. 
Three  components  are  involved  in  this  process:  the  Algebraic  Analyzer,  the  Polygen  Operation 
Interpreter,  and  the  Query  Optimizer,  as  shown  in  Figure  2. 


Polygen 

Algebraic 

Expression 


Polygen 

Operation 

Matrix 


Intermediate 

Operation 

Matrix 


Query 

Execution 

Plan 


Figure  2:  The  Polvgen  Querv  Translation  Process 


The  Algebraic  Analyzer  parses  a  polygen  algebraic  expression  and  generates  a  Polygen 
Operation  Matrix.  For  example,  the  Polygen  Operation  Matrix  for  the  example  polygen  algebraic 
expression  is  presented  in  Table  1  below.  The  first  row  indicates  that  a  Select  operation  should  be 
performed  on  the  Left-Hand  Relation  (LHR)  PALUMNUS  using  the  9  relation  "="  between  the  Left- 
Hand  Attribute  (LHA)  DEGREE  and  the  Right-Hand  Attribute  (RHA)  "MBA."  In  this  case,  there  is  no 
need  for  a  Right-Hand  Relation  (RHR).  The  result  is  denoted  by  R(l),  a  Polygen  Relation  (PR). 
Details  of  the  Algebraic  Analyzer  is  beyond  the  scope  of  this  paper. 

Table  1:  The  Polygen  Operation  Matrix  for  the  Example  Polygen  Algebraic  Expression 


PR 

OP 

LHR 

LHA 

9 

RHA 

RHR 

R(l) 

Select 

PALUMNUS 

DEGREE 

= 

"MBA" 

nil 

R(2) 

Join 

R(l) 

ANAME 

= 

CEO 

PORGANIZATION 

R(3) 

Project 

R(2) 

ONAME,  CEO 

nil 

ml 

nil 

Next  the  Polygen  Operation  Interpreter  expands  the  Polygen  Operation  Matrix  and  generates 
an  Intermediate  Operation  Matrix.  In  addition  to  the  Polygen  Operation  Matrix,  the  Polygen 
Operation  Interpreter  takes  the  polygen  schema  as  an  input  in  order  to  produce  the  Intermediate 
Operation  Matrix.  A  two-pass  Polygen  Operation  Interpreter  algorithm,  pass  one  dealing  with  the 
left-hand  side  and  pass  two  the  right-hand  side  of  polygen  opjerations  has  been  developed  (Wang  & 
Madnick,  1990).   We  illustrate  the  algorithm  below. 
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The  input  to  pass  one  is  a  Polygen  Operation  Matrix  as  Table  1  exemplifies  and  an  empty 
Intermediate  Operation  Matrix.  The  output  from  pass  one  (and  input  to  pass  two)  is  a  half-processed 
Intermediate  Operation  Matrix,  as  shown  in  Table  2.  The  execution  location  (EL)  of  an  of)eration 
depends  on  where  the  data  resides.  Note  that  when  the  execution  location  is  an  LQP  (e.g.,  AD  in  the 
first  row  of  Table  2),  it  is  also  used  as  the  originating  source  tag  for  each  of  the  cell,  c(o),  of  the  polygen 
base  relation  (R(l)  in  this  case). 

Table  2:  A  Half-Processed  lOM  Generated  h\  Pass  One  of  the  POI  Algorithm 


PR 

OP 

LHR 

LHA 

e 

RHA 

RHR 

EL 

R(l) 

Select 

ALUMNUS 

DEG 

= 

"MBA" 

nil 

AD 

R(2) 

Join 

R(l) 

ANAME 

= 

CEO 

PORGANIZATION 

PQP 

R(3) 

Project 

R(2) 

ONA.ME,  CEO 

nil 

ml 

nil 

PQP 

In  this  example,  pass  one  recognizes  that  the  first  row  of  Table  1  contains  the  polygen  relation 
PALUMNUS  whose  attribute  DEGREE  corresponds  to  ((AD,ALUMNUS,DEG)).  Thus,  LS= ALUMNUS, 
LA=DEG,  LD=AD,  and  the  tuple  (R(l),  Select,  ALUMNUS,  DEG,  =,  "MBA",  nil,  AD)  is  inserted  into 
the  first  row  of  Table  2  which  is  empty  initially.  The  second  and  third  row  of  Table  1  are  mapped  into 
Table  2  without  any  change,  and  the  PQP  is  assigned  as  the  execution  location  because  the  left-hand 
relations,  R(l)  and  R(2),  reside  in  the  PQP. 

In  general,  the  left-hand  relation  is  either  a  relation  defined  by  the  polygen  schema  or  a  R(#) 
denoting  a  polygen  base  relation  (or  a  p>olygen  relation  derived  from  other  polygen  base  relations).  In 
the  first  case,  the  left-hand  relation  may  correspond  to  either  one  or  multiple  local  relations.  If  only 
one  local  relation  exists,  then  the  polygen  operation  is  mapped  into  the  local  operation,  and  the 
corresponding  LQP  is  assigned  as  the  execution  location.  If  multiple  local  relations  exist,  then  these 
relations  are  retrieved  and  merged  first  before  the  requested  operation  is  performed  by  the  PQP.  The 
second  case  involves  an  update  of  the  R(#)  from  the  Polygen  Operation  Matrix  to  the  correspxjnding  R(#) 
in  the  half-processed  Intermediate  Operation  Matrix.  In  addition,  the  PQP  is  assigned  as  the  execution 
location  because  R(#)  resides  in  the  PQP. 

Continuing  with  the  example,  pass  two  processes  the  right-hand  side  of  Table  2  and  produces 
Table  3  below. 
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Table  3:  An  Intermediate  Operation  Matrix  for  the 

Example  Polvgen 

Algebraic  Expression 

PR 

OP 

LHR 

LHA 

s 

RHA 

RHR 

EL 

R(l) 

Select 

ALUMNUS 

DEC 

"MBA" 

nil 

AD 

R(2) 

Retrieve 

BUSINESS 

nil 

nil 

ml 

ml 

AD 

R(3) 

Retrieve 

FIRM 

nil 

nil 

nil 

nil 

CD 

R{4) 

Merge 

R(2),  R(3) 

niJ 

ml 

ml 

ml 

PQP 

R(5) 

Join 

R(l) 

ANAME 

= 

CEO 

R(4) 

PQP 

R(6) 

Project 

R(5) 

ONAME,  CEO 

ml 

ml 

nil 

PQP 

The  first  row  of  Table  2  is  copied  over  to  Table  3  directly  because  the  right-hand  relation  is  non- 
existent (nil)  and  no  other  mapping  is  required.  The  second  row  of  Table  2  is  a  Join  between  R(l)  and 
PORGANIZATION  which  corresponds  to  the  BUSINESS  and  FIRM  local  relations.  As  such,  these  two 
relations  are  retrieved  (Row  2-3,  Table  3),  merged  (Row  4,  Table  3),  and  followed  by  a  Join  with  R(l)  of 
Table  2  ~  mapping  to  R(l)  of  Table  3.  Finally,  the  third  row  of  Table  2  maps  to  the  sixth  row  of  Table  3. 

In  general,  three  possibilities  exist  for  the  right-hand  relation:  (1)  a  relation  defined  by  the 
polygen  schema,  (2)  a  R(#)  denoting  a  pjolygen  base  relation  or  a  polygen  relation  derived  from  other 
fxjlygen  base  relations,  and  (3)  non-existent  (nil).  The  second  and  third  cases  follow  the  second  case  of 
pass  one  closely.  The  first  case  is  also  similar  to  pass  one  unless  both  the  left-  and  right-hand  sides 
require  LQP  operations.  That  being  the  condition,  separate  LQP  operations  need  to  be  performed  first 
before  the  requested  polygen  operation  is  performed.  In  addition,  the  polygen  to  local  attribute 
mapping  assigned  in  pass  one  needs  to  be  reversed. 

Finally,  the  Query  Optimizer  examines  the  Intermediate  Operation  Matrix  and  generates  a 
query  execution  plan.  The  issue  of  distributed  query  optimization  has  been  discussed  extensively  in  the 
literature  (Bernstein,  et  al.,  1981;  Ceri  &  Pelagatti,  1984;  Epstein,  Stonebraker,  &  Wong,  1978;  Hevner 
&  Yao,  1979).  There  are  several  aspects  of  query  optimization  in  heterogeneous  database  systems  that 
are  not  present  in  this  distributed  query  optimization  algorithms.  These  are  direct  consequences  of 
heterogeneity  which  is  manifested  in  two  ways:  (1)  the  relative  processing  speeds  -  some  local 
database  management  systems  are  more  optimized  or  based  on  a  faster  hardware  platform,  and  (2) 
their  capabilities  -  for  example,  a  relational  DBMS  normally  has  features  such  as  ORDER  BY,  SUM, 
and  AVG  whereas  a  menu-driven  system  may  not.  The  speeds  and  capabilities  of  the  local  database 
systems  can  be  provided  as  parameters  to  the  query  optimizer  which  would  compare  various 


18 


alternative  execution  plans,  factors  the  differences  in  speeds  into  its  cost  evaluation,  and  insures  that 
all  queries  sent  to  a  local  DBMS  can  be  processed  there  (Dayal,  1983). 

In  this  example,  the  first  two  rows  of  Table  3  are  routed  to  the  Alunnni  Database  (AD)  LQP 
simultaneously  and  the  third  row  to  the  Company  Database  (CD)  LQP.  The  returned  relations  are 
further  processed  by  the  PQP  in  order  to  produce  a  composite  answer.  Note  also  that  the  local  database 
systems  will  most  likely  have  their  own  high-level  query  languages,  such  as  SQL,  with  their  own 
optimization  methods.  As  such,  the  algebraic  expressions  could  be  synthesized  before  sending  to  the 
corresponding  local  database  systems. 

4.     Example  Source  Tagging  in  the  PQP 

We  now  illustrate  the  processing  of  the  example  polygen  query  assuming  the  following  local 
relations  using  Table  3  as  a  query  execution  plan. 


The  Alumnus  Relation  (AD) 


The  Career  Relation  (AD) 


The  Business  Relation  (AD) 


AID# 

ANAME 

DEC 

MAJ 

012 

)ohn  McCaulev 

MBA 

IS 

123 

Bob  Swanson 

MBA 

MCT 

345 

James  Yao 

BS 

EECS 

456 

Dave  Horton 

MBA 

IS 

567 

John  Reed 

MBA 

MCTT 

678 

Bob  Horton 

SF 

MGT 

789 

KenOlsen 

MS 

EE 

AID# 

BNAME 

012 

Citicorp 

123 

Genentech 

345 

Oracle 

456 

Ford 

567 

Citicorp 

678 

BP 

789 

DEC 

BNAME 

IND 

IBM 

Hish  Tech 

CitiCorp 

Banking 

Oracle 

HjRh  Tech 

Ford 

Automobile 

DEC 

Hi^h  Tech 

BP 

Enerjjv 

Genentech 

HiRh  Tech 

The  Firm  Relation  (CD) 


The  Finance  Relation  (CD) 


FNAME 

CEO 

HQ 

AT&T 

Robert  Allen 

NY,  NY 

Banker  s  Trust 

Charles  Santord 

NY,  NY 

GtiCorp 

John  Reed 

NY,  NY 

Ford 

Donald  Peterson 

Dearborn,  Mi 

IBM 

John  Ackers 

Annonk,  NY 

Apple 

John  Scullev 

Cupertino,  CA 

Oracle 

Lawrence  Ellison 

Belmont,  CA 

DEC 

Ken  Olsen 

Mavnard,  MA 

Genentech 

BobSvvanson 

So  San  Franasco,  CA 

FNAME 

YR 

PROFIT 

AT&T 

1989 

-1  7bil 

Banker's  Tnist 

1989 

648  mil 

CitiCorp 

1989 

1.7  bU 

Ford 

1989 

5.3  bU 

IBM 

1989 

5.5  bU 

Apple 

1989 

400  mU 

Oracle 

1989 

43  mil 

DEC 

1989 

1.3  bil 

Genentech 

1989 

21  mil 

The  first  row  of  Table  3  indicates  that  the  operation  ALUMNUS(DEG  =  "MBA")  should  be 
executed  by  the  Alumni  Database  LQP  and  the  result  is  shown  in  Table  4.  Note  that  the  data  source 
cell  is  the  set  (AD)  which  is  taken  directly  from  the  EL  column  in  Table  3.  The  intermediate  source  is  an 
empty  set. 
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Table  4:  Result  of  the  Operation  of  Row  1,  Table  3 

AID« 

ANAME 

DEC 

MAJ 

012,  (AD),  0 

John  McOulev,  (AD),  () 

MBA,  (AD),  n 

IS,  (AD),  0 

123,  (AD),  0 

Bob  Swanson,  (AD),  0 

MBA,  (AD),  n 

MGT  (AD),  0 

456,  (AD),  n 

Dave  Horton,  (AD),  () 

MBA,  IAD),  (] 

IS,  (AD),  0 

567,  (AD),  0 

John  Reed,  (AD),  fl 

MBA,  (AD),  {) 

MIT,  (AD),  0 

The  BUSINESS  and  FIRM  relations  are  retrieved  from  the  Alumni  Database  and  the  Company 
Database  resp>ectively,  then  merged  in  the  PQP.  The  result  is  shown  in  Table  5.  The  Outer  Natural 
Primary  Join,  Outer  Natural  Total  Join,  and  Coalesce  operations  for  generating  Table  5  is  shown  in 
Appendix. 


Table  5:  Result  of  the  Operation 

of  Row  2  through  R4,  Table  3 

ONAME 

INDUSTRY 

CEO 

HEADQUARTERS 

IBM,  (AD,  CD),(AD,  CD) 

High  Tech,  (AD),(AD,  CD) 

)ohn  Ackers,  ((CD), (AD,  CD) 

NY,  (CD), (AD,  CD) 

CidCorp,  (AD  CD), (AD,  CD) 

Banking,,  (AD), (AD,  CD) 

John  Reed,  (CD), (AD,  CD) 

NY,  (CD), (AD,  CD) 

Grade,  (AD,  CD|,(AD,  CD) 

High  Tech,  (AD), (AD,  CD) 

Uwrence  Elhson,  (CD), (AD,  CD) 

CA,  (CDI.IAD,  CD) 

Ford,  (AD,  CD), (AD,  CD) 

Automobile,  (AD), (AD) 

Donald  Peterson,  (CD), (AD,  CD) 

MI,  (CD), (AD,  CD) 

DEC,  (AD,  CD), (AD,  CD) 

High  Tech,  IAD), (AD,  CD) 

Ken  Olsen,  (CD), (AD,  CD) 

MA,  (CD),(AD,  CD) 

BP,  (AD),(AD) 

Energv,  (AD), (AD) 

ml,  n,  (AD) 

ml,  0,  (AD) 

Genentech,  (AD,  CD), (AD,  CD) 

High  Tech,  (AD), (AD) 

Bob  Swanson,  (CD), (AD,  CD) 

CA,  (CD),(AD,  CD) 

ATtT,  (CD), (CD) 

ml,  0,  (CD) 

Robert  Allen,  (CD), (CD) 

NY,  (CD),(CD) 

Bankers  Trust,  (CD), (CD) 

ml,  )),  (CD) 

Charles  Sanford,  (CD), (CD) 

NY,  (CD), (CD) 

Apple,  (CD),(CD) 

ml,  0,  (CD) 

John  Scullev,  (CD), (CD) 

CA  (CD),() 

The  PQP  now  joins  Table  4  with  Table  5  and  produces  Table  6  as  shown  below. 


Table  6:  Result  of  the  Operation  of  Row  5,  Tab 

le3 

AID# 

ANAME 

DEC 

MAI 

ONAME 

INDUSTRY 

CEO 

HEADQUARTERS 

123, 

(AD), 

(AD,  CD) 

Bob  Swanson, 

(AD), 

(AD,  CD) 

MBA, 

(AD), 

'AD,  CD) 

MGT, 

(AD), 

(AD,  CD) 

Genentech, 
(AD,  CD), 
(AD,  CD) 

High  Tech, 

(AD), 
(AD,  CD) 

Bob  Swanson, 

(CD), 

(AD,  CD) 

CA, 

(CD), 

(AD,  CD) 

567, 

(AD), 

(AD,  CD) 

John  Reed, 

(AD), 
(AD,  CD) 

MBA, 

(AD), 

(AD,  CD) 

MIT, 

(AD), 

(AD,  CD) 

Qhcorp, 
(AD,  CD), 
(AD,  CD) 

Banking, 

(AD), (AD, 

CD) 

John  Reed, 

(CD), 
(AD.  CD) 

NY, 

(CD), 

(AD,  CD) 

Finally,  Table  6  is  projected  to  form  Table  7  which  contains  only  those  organizations  and  their 
CEOs  who  graduated  from  MITs  Management  School  with  an  MBA  degree. 

Table  7:  Result  of  the  Operation  of  Row  6,  Table  3 


ONAME 

CEO 

Genentech,  (AD,  CD),  (AD,  CD) 

Bob  Swanson,  (CD), (AD,  CD) 

CiticoiT),  (AD,  CD),  (AD,  CD) 

John  Reed,  (CD), IAD,  CD) 

Several  observations  can  be  made  from  the  example: 

(1)  The  information  of  Genentech  is  from  the  Alumni  Database  and  Company  Database. 

(2)  The  information  that  Genentech's  CEO  is  Bob  Swanson  comes  from  the  Company  Database,  and  the 

Alumni  Database  has  served  as  an  intermediate  source  in  obtaining  the  information. 
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(3)  The  polygen  query  processor  can  derive  the  information  that  Genentech  is  from  the  Alumni 
Database's  BNAME  relation  and  Company  Database's  FNAME  relation  from  the  pKjIygen  schema 
and  the  information  of  (ONAME,  (AD,  CD)).  his  information  can  be  shown  to  the  user  upon 
request  with  a  simple  mapping. 

In  this  simple  example,  the  data  source  information  can  be  obtained  by  inspection  from  the 
polygen  schema.  The  intermediate  source  information  is  not  observable  from  the  polygen  schema.  In  a 
federated  database  system  with  hundreds  of  databases  in  which  a  polygen  query  is  optimized  to  select 
only  the  relevant  databases  for  information  retrieval,  the  data  source  information  observed  from  the 
polygen  schema  is  a  suf)er  set  of  the  result  obtained  by  the  PQP.  We  now  turn  our  attention  to  other 
theoretical  issues  of  source  tagging. 

5.     The  Necessary  and  Sufficient  Condition  of  Source  Tagging 

The  polygen  model  presented  in  Section  2  is  based  on  the  assumption  that  the  source  is  tagged  at 
the  cell  level  after  the  data  has  been  retrieved  from  a  local  database.  Two  fundamental  issues  are 
addressed  in  this  section:  (1)  How  many  other  p>otential  approaches  exist  for  source  tagging?  (2)  Does 
the  closure  property  hold  for  the  polygen  algebra?  (That  is,  does  a  polygen  operation  over  a  set  of 
polygen  relations  always  produce  a  px)lygen  relation?) 

We  address  these  two  issues  through  the  following  lemma  and  theorem.  Specifically,  we  show 
that  although  there  are  four  conceivable  ways  to  tag  sources,  the  closure  property  holds  if  and  only  if 
sources  are  tagged  by  cell. 

(Lemma)  In  extending  the  Relational  Model  to  a  polygen  model,  there  exists  four  ways  to  source 
tagging:  by  cell,  by  tuple,  by  attribute,  and  by  relation. 

Since  the  pxjlygen  model  is  based  on  the  Relational  Model,  the  granularity  of  a  data  object  to  be 
tagged  cannot  be  coarser  than  a  relation  because  a  relation  is  the  basic  unit  of  an  algebraic  operation. 
On  the  other  hand,  the  granularity  cannot  be  finer  than  a  cell  because  a  cell  is  the  smallest  unit  of  a 
relation.  In  addition,  source  tags  are  deleted  or  updated  by  algebraic  operators,  all  of  them  perform 
operations  either  by  tuple  (Cartesian  product,  union,  difference,  and  restrict)  or  by  attribute  {project. 
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coalesce).  It  follows  that  sources  may  be  tagged  by  cell,  by  tuple,  by  attribute,  or  by  relation. 
^Theorem)    The  closure  property  holds  if  and  only  if  source  tagging  is  by  cell. 

Let  E  denote  the  set  of  results  obtained  from  all  possible  combinations  of  algebraic  operations 
defined  in  a  polygen  model.  Let  ei  and  ei'  denote  two  base  p>oIygen  relations.  Let  f  denote  an  algebraic 
operation,  e2  =  f(e])  if  f  is  project,  restrict,  or  coalesce.  62  =  f(ei,  e,')  if  f  is  Cartesian  product,  union,  or 
difference.  Similarly,  let  ey+i  =  He),)  for  some  e^  e  E  if  f  is  project,  restrict,  or  coalesce;  e^t]  =  ((.e^  e^')  for 
some  e^  e  E,  eit'  e  E,  if  f  is  Cartesian  product,  union,  or  difference.  We  now  show  that,  by  induction,  if 
source  tagging  is  by  cell,  then  the  closure  property  holds,  i.e,  e  is  a  polygen  relation  defined  by  the 
polygen  model  V  e  e  E.  Only  the  originating  source  portion  is  shown  below;  the  intermediate  source 
portion  can  be  shown  by  the  same  token.  For  consistency,  we  use  the  notations  developed  in  Section  2. 
(Proof)  Part  1:  The  closure  property  holds  =>  Source  tagging  is  by  cell. 

Suppose  that  the  closure  property  holds  and  source  tagging  is  not  by  cell.  It  follows,  by  the 
Lemma,  that  there  exists  a  polygen  model  in  which  source  tagging  is  by  relation,  by  attribute,  or  by 
tuple.  If  source  tagging  is  by  relation,  then  a  relation  in  this  polygen  model  can  be  expressed  as  (e,  e<o)). 
Consider  the  Cartesian  product  of  (ei ,  e]<o»  x  (e2,  e2<o».  By  definition,  the  operation  yields  (t,  "  t2  :  t^ 
e  ei  and  t2  e  62,  where  °  denotes  concatenation).  However,  the  result  cannot  be  expressed  in  the  form 
of  (e,  e<o))  because  the  originating  source  tags  from  t,  may  be  different  from  t2.  It  follows  that  source 
tagging  by  relation  is  not  feasible.  If  source  tagging  is  by  attribute,  then  an  attribute  in  this  polygen 
model  can  be  expressed  as  (e(x),  e(x)<o)).  Consider  union.  By  definition,  tXd)=ti(d),  t'(xKo)=t,(x)(o>  u 
t2(x)(o)  if  ti(d>=t2(d>  .  However,  the  result  cannot  be  expressed  in  the  form  of  (e(x),  e<x)(o»  because 
ti(x)(o>  may  be  different  from  t2(x)<o).  It  follows  that  source  tagging  by  attribute  is  not  feasible.  If 
source  tagging  is  by  tuple,  then  a  tuple  in  this  polygen  model  can  be  expressed  as  (t,  t(o)).  Consider 
Cartesian  product.  By  the  similar  argument,  the  result  cannot  be  expressed  in  the  form  of  (t,  t<o».  It 
follows  that  source  tagging  by  tuple  is  not  feasible.  By  contradiction,  we  conclude  that  the  proposition 
is  true. 

Part  2:  Source  tagging  is  by  cell  =>  The  closure  property  holds. 

The  premise  that  source  tagging  is  by  cell  justifies  the  usage  of  the  polygen  model  presented  in 
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Section  2  in  the  following  proof.  By  the  model's  definition,  t<o)  is  the  set  of  the  originating  source  V  te 
ej ,  V  ej  6  E,  and  the  closure  property  holds  V  e,  e  E. 

Assuming  that  the  closure  property  holds  for  V  e^  £  E,  we  show  that  the  closure  property  also 
holds  V  ev+i  e  E. 

For  projection,  e^+i  =  e^(X).  Two  cases  need  to  be  considered:  (1)  t(X)(d>  is  unique  and  (2) 
ti(X)<d)=...=  tk(X)(d).  In  the  first  case,  t'(xjXo>  =  t(xj)(o)  V  xj  e  X.  In  the  second  case,  t'(xj)<o>=  ti(xj)<o> 
U...U  tk(xj)<o)  V  xj  6  X.  Since  the  closure  property  holds  for  ti(X)(o>=...=  tk(X){o>,  thus  the  closure 
property  also  holds  for  t'(xj)(o)  V  x;  €  X.   It  follows  that  the  closure  property  holds  for  6^+]  =  etCX). 

For  Cartesian  product,  ey^i  =  (e^  x  e'^ )  =  (t,  °  t2  :  t,  g  e^  and  t2  e  e'^)  where  "denotes 
concatenation.  For  difference,  e^^i  =  (ev.-  e'k)  =  (t  :  t  e  e^,  t<d>«  e'J.  For  restrict,  e^*]  =  ev(x  9  y)  =  ( t :  t 
e  ek  A  (t(x)(d>  8  t(y)<d))).  Since  t(o)  remains  the  same  in  Cartesian  product,  difference,  and  restrict,  it 
follows  that  the  closure  property  holds  for  e^^i  =  (e^  x  e'^ ) ,  e^^i  =  (e^-  e'^)  and  ei-^i  =  e^ix  6  y).  The 
closure  property  holds  for  union  and  coalesce  following  the  similar  arguments.  From  the  Principle  of 
Mathematical  Induction,  we  conclude  that  the  proposition  is  true. 

6.     Summary  and  Conclusions 

We  have  presented  a  polygen  model  for  resolving  the  Data  Source  Tagging  and  Intermediate 
Source  Tagging  problems.  The  polygen  model  research  addresses  issues  in  data  integration  from  the 
"where"  persp)ective  -  a  perspective  that,  to  the  best  of  our  knowledge,  has  not  been  studied  to  date. 
Furthermore,  we  have  presented  a  data-driven  query  translation  mechanism  for  mapping  a  polygen 
algebraic  expression  into  a  set  of  intermediate  polygen  op>erations  dynamically.  A  Prototype,  called 
System  P,  has  been  implemented  (Yuan,  1990)  to  demonstrate  the  feasibility  the  polygen  model  and  the 
polygen  query  processing  capability  presented  in  this  paper. 

This  research  has  also  provided  us  with  a  theoretical  foundation  for  further  investigation  of 
many  other  critical  research  issues  in  heterogeneous  distributed  systems,  for  example  the  cardinality 
inconsistency  problem  which  is  inherent  in  heterogeneous  database  systems.''   It  also  enable  us  to 


Under  the  relational  assumption,  the  cardinality     inconsistency     problem  exists  in  heterogeneous 
database  systems  because  the  referential  integrity  is  not  enforceable  over  multiple  pre-existing 
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interpret  information  from  different  sources  more  accurately.  By  storing  the  metadata  about  each  of  the 
data  sources  in  the  PQP,  many  domain  mismatch,  semantic  reconciliation,  and  data  conflict  problems 
could  be  resolved  systematically  using  the  data  and  intermediate  source  tags.  Furthermore,  other 
polygen  models  can  be  developed  for  heterogeneous  distributed  database  systems  based  on  the  Entity 
Relationship  Model,  the  Functional  Data  Model,  and  the  more  recent  object-oriented  models  (Manola  & 
Dayal,  1986;  Shaw  &  Zdonik,  1990). 

The  data  source  and  intermediate  source  information  can  be  very  valuable  to  the  user  as  well  as 
the  polygen  query  processor  in  formulating  cost-effective,  customized,  and  credible  composite 
information  in  a  federated  database  environment.  As  more  and  more  important  applications  require 
seamless  access  to  and  integration  of  data  from  multiple  heterogeneous  database  systems  both  within 
and  across  organizational  boundaries,  these  capabilities  will  also  become  increasingly  critical. 


databases  which  have  been  developed  and  administered  independently  and  are  likely  to  remain  so. 


24 


Appendix:  The  Operations  that  Generate  Table  5. 

The  second  and  third  row  of  Table  3  indicates  that  the  BUSINESS  and  FIRM  relations  should 
be  retrieved  from  the  Alumni  Database  and  the  Company  Database  respectively.  As  such,  the 
corresponding  data  source  cells  are  the  set  (AD)  and  (CD)  respectively  as  show^n  in  Table  Al  and  Table 
A2  below.  The  intermediate  source  is  an  empty  set  because  no  other  data  sources  have  been  involved  in 
obtaining  these  relations. 


Table  Al:   The  Business  Relation 

BNAME 

IND 

IBM,  |AD!,ii 

High  Tech,  lADl.O 

CitiCorp.  lADhi) 

BankiPR,  (AD),(1 

Oracle,  |ADt,(] 

Hieh  Tech,  lADlD 

Ford,  (ADI.I) 

Automobile,  (AD1,() 

DEC,  (ADl.O 

High  Tech,  (AD),n 

BP,  lADl.O 

Energv,  (AD!,n 

Genentech,  (AD),() 

High  Tech,  (AD),n 

Table  A2:  Tlie  Firm  Relation 


FNAME 

CEO 

HQ 

AT&T,  (CDI,n 

Robert  Allen,  {CD),(1 

NY,  (CDIO 

Barkers  Trust,  (CD|,(! 

Charles  Sanford,  iCDKO 

NY,  (CDKl 

QtiCorp,   ICDKO 

John  Reed,  ICDKO 

NY,  (CD!,() 

Ford,  ICDl,tj 

Donald  Peterson,  |CD),() 

Ml,  ICDKO 

IBM,  (CD),() 

John  Ackers,  (CD1,() 

NY,  (CDKO 

Apple,  ICDIO 

John  Scullev,  |CD),() 

CA.  (CDLO 

Oracle,   ICDH) 

Lawrence  Ellison,  ICD),() 

CA,  ICDKO 

DEC,  (CD),i! 

Ken  Olsen,  (CD),() 

MA,  1CD),|) 

Genentech,  (CD'',() 

BobSwanson,  tCDLO 

CA,  ICDKO 

Table  Al  and  A2  are  merged  together  (see  Row  4,  Table  3)  to  generate  Table  5.  This  process 
involves  an  Outer  Natural  Total  Join  (ONTJ)  of  Table  Al  and  Table  A2.  The  Outer  Natural  Total  Join 
consists  of  three  steps:  (1)  An  outer  join  on  BNAME  and  FNAME  because  they  are  the  local  attributes  of 
the  primary  polygen  attribute  ONAME  for  PORGANLZATION.  The  result  is  shown  in  Table  A3. 
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Table  A3  The  Outer  join  of  Table  Al  and  Table  A2 


BNAME 

IND 

FNAME 

CBO 

HO 

IBM, 
(AD), (AD,  CD) 

High  Tech, 
AD),(AD,  CD) 

IBM, 
(CD), (AD,  CD) 

John  Ackers, 
((CD), (AD,  CD) 

NY, 
(CD),(AD,  CX>) 

CitiCorp, 
(AD),(AD,  CD) 

Banking, 
(AD), (AD,  CD) 

Qticorp, 
(CD), (AD,  CD) 

John  Reed, 

(CD), (AD,  CD) 

NY, 
(CD), (AD,  CD) 

Oracle, 

{AD),(AD,  CD) 

High  Tech, 
(AD),(AD,  CD) 

Oracle, 

(CD), (AD,  CD) 

Lawrence  Ellison, 
(CD), (AD,  CD) 

CA, 

(CD),(AD,  CD) 

Ford, 

fAD),(AD,  CD) 

Automobile, 

(AD),(AD,  CD) 

Ford, 
(CD),  (AD,  CD) 

Donald  Peterson, 

(CD), (AD,  CD) 

Ml, 

(CD),(AD,  CD) 

DEC, 
(AD),(AD,  CD) 

High  Tech, 
(AD),(AD,  CD) 

DEC, 
(CD), (AD,  CD) 

KenOlsen, 
(CD), (AD,  CD) 

MA, 
(CD),(AD,  CD) 

BP,  fAD),(AD) 

Energy,  (AD), (AD) 

ml,    0,  (AD) 

nil,    n,  (AD) 

nU,  (},  (AD) 

Genentech, 

(AD),{AD,  CD) 

High  Tech, 
(AD),(AD,  CD) 

Genentech, 

(CD), (AD,  CD) 

Bob  Swanson, 

(CD), (AD,  CD) 

CA, 

{CD),(AD,  CD) 

nil, 
0,  (CD) 

ml, 

n.  (CD) 

AT&T, 
(CD), (CD) 

Robert  Allen, 
(CD), (CD) 

NY, 
(CD), (CD) 

nil, 
0,  (CD) 

ml, 

n,  (CD) 

Banker  s  Trust, 

(CD), (CD) 

Charles  Sanford, 
(CD), (CD) 

NY, 
(CD), (CD) 

nil, 

n,  (CD) 

ml, 

n,  (CD) 

Apple, 

(CD), (CD) 

John  Sculley, 

(CD), (CD) 

CA 

(CD),() 

(2)  A  Coalesce  of  the  BNAME  and  FNAME  columns  into  the  ONAME  column.  The  result  is 
shown  in  Table  A4.  As  we  defined  in  Section  2,  step  one  and  two  together  are  called  an  Outer  Natural 
Primary  Join. 


Table  A4  The  Outer  Natural  Primarv  Join  of  Table  Al  and  Table  A2 

ONAME 

IND 

CEO 

HQ 

IBM,  (AD,  CD), (AD,  CD) 

High  Tech,  (AD),(AD,  CD) 

John  Ackers,  (ICD), (AD,  CD) 

NY,    CD), (AD,  CD) 

auCorp,  (AD,  CD),(AD,  CD) 

Banking,  (AD), (AD,  CD) 

John  Reed,  (CD), (AD,  CD) 

NY,   CD), (AD,  CD) 

Oracle,  (AD,  CD),{AD,  CD) 

High  Tech,  (AD(,(AD,  CD) 

Lawrence  Blison,  (CD), (AD,  CD) 

CA,  (CD), (AD,  CD) 

Ford,  (AD,  CD), (AD,  CD) 

Automobile,  (AD), (AD  CD) 

Donald  Peterson.  (CD), (AD,  CD) 

MI,  (CD),(AD,  CD) 

DEC,  (AD,  CD),(AD,  CD) 

High  Tech,  (AD), (AD,  CD) 

Ken  Olsen,  ICD), (AD,  CD) 

MA,  (CD),(AD,  CD) 

BP,  (AD), (AD) 

Energy,  (AD), (AD) 

ml,  0,  (AD) 

nil,  0,  (AD) 

Genentech,  (AD,  CD),{AD,  CD) 

High  Tech,  (AD),(AD  CD) 

Bob  Swanson.  (CD). (AD,  CD) 

CA,  (CD),(AD,  CD) 

AT&T,  (CD),(CD) 

ml,  0,  (CD) 

Robert  Allen,  (CD), (CD) 

NY,  (CD), (CD) 

Bankers  Trust,  (CD), (CD) 

nil,  0,  (CD) 

Charles  Sanford,  (CD). (CD) 

NY,  (CD), (CD) 

Apple,  (CD), (CD) 

ml,  0,  (CD) 

John  Scullev,  (CD), (CD) 

CA  (CDl.O 

(3)  Coalesces  of  other  local  columns  into  the  corresponding  non-primary  polygen  columr\s.  Since 
no  other  overlapping  local  columns  exist  in  this  simplified  example,  only  the  local  attributes  IND  and 
HQ  are  changed  to  INDUSTRY  and  HEADQUARTERS.  The  result  is  shown  as  Table  5  in  the  body  of 
the  paper. 
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